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Abstract 

An effective way to understand the genomics of divergence in non-model organisms is to use the tran- 
scriptome to identify genes associated with divergence. We examine the transcriptome of the song 
sparrow (Melospiza melodia) and contrast it with the avian models zebra finch (Taeniopygia guttata) 
and chicken (Gallus ij alius). We aimed to (i) obtain a functional annotation of a substantial portion of 
the song sparrow transcriptome; (ii) compare transcript divergence; (iii) efficiently characterize single nu- 
cleotide polymorphism/indel markers possibly fixed between song sparrow subspecies; and (iv) identify 
the most common set of transcripts in birds using the zebra finch as a reference. Using two individuals 
from each of three populations, whole-body mRNA was normalized and sequenced (1 10 Mb total). The 
assembly yielded 38 539 contigs [N50 (the length -weighted median) = 482 bp]; 4574 were orthologous 
to both model genomes and 3680 are functionally annotated. This low-coverage scan of the song sparrow 
transcriptome revealed 29 982 SNPs/indels, 1402 fixed between populations and subspecies. Referencing 
zebra finch and chicken, we identified 43 and 5 fast-evolving genes, respectively. We also identified the 
most common set of transcripts present in birds with respect to zebra finch. This study provides new 
insight into songbird transcriptomes, and candidate markers identified here may help research in song- 
birds (oscine Passer if ormes), a frequently studied group. 
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1. Introduction 

Determining the genetic underpinnings of organis- 
mal divergence and speciation will provide insight 
into the evolutionary generation of biodiversity, and 
next-generation sequencing is propelling such 
studies in non-model organisms. 1,2 An effective way 
to initiate genomic-wide data sets in non-model 



These authors contributed equally. 

Present address: The Jackson Laboratory, 600 Main Street, Bar 
Harbor, ME 04609, USA. 



organisms is to focus on the transcriptome, or 
expressed sequence, which, unlike a whole-genome 
approach, increases the data's focus on functional 
genomic attributes. 3,4 As these data become avail- 
able, evolutionary biologists will be able to make con- 
trasts within and among lineages to identify genes 
associated with divergence. 5-8 To gain insight into 
the genes associated with avian diversification, we 
examine the transcriptome of the song sparrow 
{Melospiza melodia) and contrast it with the model 
birds zebra finch (Taeniopygia guttata) and chicken 
(C. gallus). 
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The song sparrow is broadly distributed across 
North America and exhibits pronounced morpho- 
logical variation, with 2 5 subspecies recognized (of 
52 described 9 ). It has been extensively studied over 
the past 70 yrs; it is considered a model vertebrate 
species for field research; and it will continue to be a 
focus for questions about the causes of population 
variation in behaviour, demographics, and morph- 
ology. 10 Our goals in this study were to (i) obtain a 
functional annotation of a substantial portion of the 
song sparrow transcriptome; (ii) compare transcript 
divergence between the song sparrow and the two 
bird genomes sequenced and assembled to the 
highest quality thus far, zebra finch (T. guttata) and 
chicken (G. gallus); (iii) efficiently characterize a set 
of single nucleotide polymorphism (SNP)/indel 
markers that may be fixed between song sparrow sub- 
species; and (iv) identify the most common set of 
transcripts present in bird species using the zebra 
finch as a reference. Achieving these goals will estab- 
lish important baseline data for a non-model organ- 
ism in a speciose group (passerines or songbirds) 
frequently studied. 



2. Materials and methods 

2.1 . Samples, cDNA library, and sequencing 

Two song sparrows still undergoing growth (from 
embryo to just-fledged) were sampled from each of 
three Alaska populations (the northwestern most dis- 
tribution of the species), chosen because they span 
some of the most pronounced morphological diver- 
sity that occurs in the species (Fig. 1): two island 
populations of M. m. maxima (from Attu and Adak 
islands; an egg and a very young nestling from Attu 
Island, unvouchered; and vouchers UAM 2 7831 and 
27832 from Adak Island) and one mainland popula- 
tion of M. m. caurina (from Cordova, vouchers UAM 
27829 and 27830). The Attu and Adak populations 
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Figure 1. Samples in this study came from Cordova (Melospiza 
melodia caurina, right in inset) and Adak and Attu islands 
(M. m. maxima, left in inset); grey shading indicates the 
species' range. 



of Melospiza m. maxima are the largest in the species 
and also have different plumage coloration; in addition, 
they are non-migratory, unlike the population from 
Cordova, which is also smaller and darker (Fig. 1). 

All samples were obtained in June (spring) at a very 
young age and only two were sexable (both females, 
one each from Cordova and Adak). The egg was 
homogenized, whereas from the others six tissues 
(brain, liver, heart, muscle, bone, and pancreas) 
were taken, minced and placed in RNAlater (Qiagen, 
Valencia, CA) within minutes of death and then 
frozen. In the laboratory, tissues were homogenized 
and total RNA was isolated using Trizol (Invitrogen, 
Carlsbad, CA) and subsequently cleaned using a 
Qiagen RNeasy column. 

Equal amounts of RNA from individuals of each 
population were pooled and an MINT universal 
cDNA kit (Evrogen, Moscow, Russia) with primers 
modified specifically for 454 procedures 11 was used 
to create cDNA libraries enriched for full-length 
transcripts. We then normalized the three cDNA li- 
braries using the TRIMMER cDNA normalization kit 
(Evrogen) to substantially decrease the relative abun- 
dance of common transcripts. The normalized cDNA 
was fragmented and prepared for sequencing using 
standard 454 procedures, including independent mo- 
lecular identifiers [MID tags: Cordova (MID 1 3), Attu 
(MID 1 8) and Adak (MID 1 9)] for each of the three 
populations. As each library contained a unique MID 
tag, libraries were pooled and sequenced as a single 
sample. Sequencing was performed at the University 
of Georgia's Georgia Genomics Facility on a Roche 
454 FLX using Titanium chemistry. 

2.2. Assembly, polymorphism, and ortholog 
identification 

Bases were called from the 454-generated sff file 
using Pyrobayes, 12 which provides improved accuracy 
in the estimation of base qualities for pyrosequences. 
We removed MINT primer sequences, short sequences, 
and other contaminatants using SeqClean (http:// 
compbio.dfci.harvard.edu), and reads from all three 
populations were combined. We performed a com- 
bined assembly of reads using MIRA, 1 3 and then used 
GigaBayes, 1 4 a short-read SNPand short indel discovery 
program, to detect polymorphisms. To make the SNP/ 
indel predictions more reliable, we used the more strin- 
gent criteria that the minor allele must occur at least 
three times and be present at >1 0% relative to the 
major allele frequency when >30 reads per locus 
were obtained (after combining all the reads for par- 
ticular alleles among different subspecies; sequences 
with fewer reads are considered the minor allele and 
sequences with more reads are considered the major 
allele). We identified orthologous contigs (against the 
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zebra finch and chicken genomes) using the reciprocal 
blast approach, because it has been found to be super- 
ior to sophisticated orthology detection algorithms. 1 5 
A stringent cutoff of 1e-20 was used to separate 
paralogues from orthologues. The cDNA sequences 
from the zebra finch (taeGut3.2.4.60.cdna.all.fa) 
and chicken (WASHUC2.60.cdna.all.fa) were obtained 
from the Biomart database (www.biomart.org). 
Although the zebra finch is a passerine and thus more 
closely related to the song sparrow, the chicken data base 
contains sequences from whole growing chicks, whereas 
that of the zebra finch emphasizes neural transcripts. 

To identify likely genomic positions of the song 
sparrow contigs, we mapped them against genomic 
sequences of the zebra finch (taeGut3.2.4.60.dna_ 
rm.toplevel.fa) and chicken (WASHUC2.60.dna_rm. 
toplevel.fa) using BLAT 16 with default criteria. We 
obtained feature information for protein-coding 
genes and ncRNA using the Ensemble (http://uswest. 
ensembl.org/index.html/) Xenoref and gtf files, 
respectively. 

2.3. Most common set of transcripts in birds 

To find the most common set of transcripts in birds 
with respect to zebra finch, we collected and 
assembled (454 GS assembler version 2.5) the tran- 
scriptome sequence of 1 2 bird species (publicly avail- 
able sequence 5,7,8,1 7 ). The orthologous sequence with 
respect to zebra finch was determined using the bidir- 
ectional blast best hit method (1e- 2 0). Only contigs 
>200 bp were used in the analysis. After determining 
the orthologous sequences, we sorted them in 
decreasing order and added orthologous sequences 
from other species sequentially to find the most 
common set. 

2.4. Functional annotation of contigs 

We used Blast2G0 1 8 (B2G) to functionally annotate 
the contigs. A combined graph was generated for each 
gene ontology (GO) category. For the molecular func- 
tion division, a graph was obtained using default cri- 
teria and for the other two divisions (cellular 
component and biological process), seq/node filter 
values were changed to 4/1 0 to prevent overloading 
the graphs. 



2.5. Estimation of substitution rates 

Substitution rates were estimated for contigs that 
were orthologous to both zebra finch and chicken. 
Reading frames for these contigs were identified using 
BLASTX 19 against protein sequences of zebra finch 
(taeGut3.2.4.60.pep.all.fa) and chicken (WASHUC2. 
60.pep.all.fa) obtained from Biomart (www.biomart. 
org). Sequences that produced significant alignments 
were extracted (using their coordinates), translated, 
and aligned using CLUSTALW. 20 Sequences that con- 
tained frame shifts were excluded from the analysis. 
Corresponding codon alignments were produced 
using PAL2NAL, 21 and, finally, rates were estimated 
using a maximum likelihood method implemented in 
the CODEML program of the PAML package Version 
4.1 . 22 Pairwise maximum likelihood analyses were per- 
formed in runmode-2. The estimated rates of non- 
synonymous to synonymous substitutions (K 3 /K s 
values) were plotted as a scatter plot in the range of 
0-2.0. 

3. Results and discussion 

3.1 . Sequence assembly 

The pooled reads from all three populations yielded 
1 3 1 Mb (458 808 sequences) of raw data, which was 
reduced to 1 10Mb (381 474 sequences) after the 
use of SeqClean (Table 1). The mean raw and 
cleaned read lengths were 286 and 290 bp, respect- 
ively. Poor-quality reads were often very short and 
were purged entirely prior to assembly. Without a ref- 
erence genome for the song sparrow, de novo assembly 
was required. Cleaned sequences were assembled into 
38 539 contigs with N50 and N90 values of 482 and 
31 7 bp, respectively (Supplementary data). There 
were 1417 singletons. The mean coverage per contig 
was 3.93 X and the mean GC content per contig was 
43.6%. 

We acknowledge that the amount of sequencing 
presented is insufficient to allow a high-quality assem- 
bly of the extremely diverse transcriptome that we have 
sampled. A large number of tissues were sampled, and 
these clearly contain a large and diverse set of tran- 
scripts (see Section 3.2). Simulations indicate that 
transcriptomes sequenced with 454 Titanium 



Table 1. Number of reads and assembly statistics for three song sparrow populations (SRA 04851 6) 



Subspecies 


Locality 


« a 


MID 


Raw reads 


Cleaned reads 


Cleaned bases (MB) 


M, m. caurina 


Cordova 


2 


1 3 


1 38439 


1 14098 


32.5 


M. m. maxima 


Adak 


2 


1 9 


135 588 


1 1 7 1 66 


34.7 


M. m. maxima 


Attu 


2 


1 8 


1 84 781 


1 50 21 0 


42.8 


Combined 




6 




458 808 


381 474 


110 



a Number of individuals pooled prior to sequencing. 
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chemistry will quickly lead to about twice as many 
contigs as transcripts, and additional sequences only 
gradually cause the number of contigs to reach the 
number of transcripts (i.e. the point when contigs = 
transcripts; data not shown). Thus, quite large 
numbers of additional sequences will be necessary to 
fully assemble the transcripts contained in these 
cDNA libraries. Given the relatively high cost of 454 se- 
quencing, it would be more economical to obtain the 
additional sequences as paired-end reads on lllumina 
or Ion Torrent platforms. 

3.2. Functional annotation 

B2G, which we used to functionally annotate the 
contigs, has three annotation steps involving (i) a 
blast against databases, (ii) mapping against GO 
resources, and (iii) annotation to generate reliable 
functional assignments. In our data, 12 880 of the 
contigs (33.46% overall, of which 8540 were unique 
hits) had significant matches to currently known pro- 
teins in the NCBI non-redundant protein database. 
Because one-third of the contigs hit the same proteins 
as other contigs in our data, this indicates that large 
transcripts were often split among multiple contigs 
in our assembly. Although it is possible to use the 
zebra finch or chicken proteins as a reference to scaf- 
fold the song sparrow contigs, we did not do this 
because it could make chimeras, and assembly of 
full-length genes was not a major goal of this work. 

As expected, zebra finch and chicken were identified 
as the top two species with the best blast hits for our 
song sparrow contigs (Table 2). Contigs with signifi- 
cant blast matches were functionally annotated. GO 
resource assignment was found for 3949 (1 0.2%) of 
the total contigs (with 24 363 GO terms; there can be 
multiple terms per contig), of which 3367 (8.7% of all 
contigs) were functionally annotated (Supplementary 
Sheet 1). 

In the first GO division, 'biological process', 23 22 cat- 
egories were identified. Most contigs (3578 = 53.1 %) 



Table 2. Species with > 1 00 top hits from B2G 



Species 


Hits 


T. guttata 


7820 


C. gallus 


2222 


Homo sapiens 


235 


Monodelphis domestica 


1 93 


Mus musculus 


1 87 


Ailuropoda melanoleuca 


1 77 


Ornithorhynchus anatinus 


1 49 


Canis familiaris 


1 19 


M. melodia 


1 1 3 


Rattus noruegicus 


1 00 



were involved in 'cellular and metabolic processes'. 
The second most abundant category was 'biological 
regulation and localization' (1253 = 1 8.6%; 
Supplementary Fig. S1A). Within the second division, 
'molecular function', 23 nine major categories were 
identified. Most of the contigs were functionally 
related to 'nucleotide binding' (1966 = 43.9%) and 
'catalytic activity' (1 266 28.2%; Supplementary Fig. 
S1 B). Finally, the last division, 'cellular component', 23 
also had nine categories. Gene products were primar- 
ily expressed intracellular^ (2322 = 41 .9%) or in the 
membrane bound/non-membrane bound organelle 
(1 787 = 32.3%; Supplementary Fig. S1 C). 

All of the GO results should be viewed with caution 
because the depth of the available sequences ensures 
that most highly expressed transcripts will have been 
sequenced but many low-expression transcripts will 
not have been detected. The normalization techni- 
ques used substantially increased the number of low- 
expression transcripts sequenced, but the number of 
sequences obtained is insufficient to overcome the 
bias toward highly expressed transcripts. 

3.3. Polymorphism detection 

We detected a total of 29 982 SNPs/indels that 
were spread relatively evenly within, between, and 
among all three populations (Fig. 2, Supplementary 
Sheet 2). A total of 1402 SNPs/indels were fixed 
between populations and subspecies (Fig. 3; the sum 
of all pairwise comparisons is 1635 because some 
pairwise SNPs are found in more than one pair). Out 
of the 1402, there were 392 and 410 SNPs/indels 
between subspecies and within-subspecies, respect- 
ively. This provides many SNPs/indels for further 
study (Supplementary Sheet 2), although given our 
limited sampling of individuals within populations 
(n = 2) many will not be true fixed differences (i.e. 
they are false positives, other individuals contain 
these variants). We also note that we have used 
quite stringent criteria for SNP/indel assignment. 




Figure 2. Numbers of SNPs and indels that are within and shared 
between and among three populations of song sparrows. 
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Figure 3. SNPs and indels that are fixed between and among three 
populations of song sparrows. There are 392 SNPs/indels that 
are identical in Attu and Adak, but different from Cordova. 
Because sample sizes are small, these figures include false 
positives. 
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Figure 4. Histogram displaying the proportion of contigs mapped 
to particular features of protein coding genes of zebra finch 
and chicken (UTR is the untranslated region, and CDS is the 
coding sequence). The upper panel displays the raw count and 
the lower panel normalized values (the proportion discovered 
relative to how many could be discovered within each category). 



By requiring at least three reads for the minor allele, a 
minimum of six times coverage is required to call a 
SNP. Because our average assembly depth is only 
about four times, most polymorphic nucleotides in 
our contigs will not pass our criteria for SNP discovery. 
Because of this, we have biased the SNPs to be from 
the relatively highly expressed transcripts. Many add- 
itional SNPs/indels occur in song sparrows, we de- 
scribe only those with a high probability of being 
real, not sequencing artefacts. None of these issues 
limits our ability to achieve our stated goals, but we 
note them so that it is understood that we have 
made appropriately cautious interpretations of our 
results. 

3.4. Orthology with zebra finch and chicken 

The reciprocal blast approach identified 4574 
contigs as orthologous to both zebra finch and 
chicken. As expected because of phylogenetic rela- 
tionships, more contigs were identified as orthologous 
to the zebra finch than the chicken: the set [unique 
song sparrow (orthologues) unique zebra finch] was 
[32 435 (6104) 12 493], whereas the set [unique 
song sparrow (orthologues) unique chicken] was [32 
767 (5772) 16 5 1 8]. A substantial number of ortho- 
logous contigs (3894) were found to have the same 
chromosome location in the zebra finch and chicken 
(Supplementary Sheet 1). 

3.5. Localization of contigs 

The zebra finch and chicken genomes were used as 
references to locate the contigs. BLAT mapping of our 
assemblies against these genomes showed sequences 
that uniquely mapped to particular features of the ref- 
erence genomes [5'UTR (untranslated region), 3'UTR, 
CDS (coding sequence), 1 kb upstream, 1 kb down- 
stream; Fig. 4A]. Based on the zebra finch genome 
annotation, nearly 34% of mapped contigs (2890 of 
8561) were found to be in CDS regions. Even with 
the use of the MINT cDNA construction kit, which is 
meant only to allow amplification of full-length tran- 
scripts, we still observed a substantial bias toward 
contigs mapping to 3'UTR and 1 kb downstream 
_relative to 5'UTR and 1 kb upstream. The normalized 
distributions clearly indicate that our libraries contain 
relatively few transcripts that are full length (Fig. 4B). 
Similar patterns, although with slightly fewer hits, were 
obtained from mapping to the chicken genome. 
The localization of contigs containing SNPs/indels 
mapped against the zebra finch and chicken genomes 
showed that a major proportion of polymorphisms 
belongs to coding sequences (Supplementary Fig. S2A 
and B). Contigs with SNPs/indels had more blast hits 
to the zebra finch than to the chicken, reflecting the 
overall pattern of all contigs. Few RNA genes were also 
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found by BLAT mapping (Supplementary Fig. S3A and 
B). 

3.6. Common set transcripts in birds 

We determined the orthologous transcripts with 
respect to zebra finch using the bidirectional blast 
best hit method in 1 2 bird species. From the ortholo- 
gous sequences, we determined the most common 
set of transcripts of zebra finch which is present in 
all species or most of the species. The first big set of 
transcripts (1004 zebra finch sequences) was 
present in seven bird species. The second largest set 
comprised 219 and 126 sequences present in 10 
and 12 bird species, respectively, and, finally, 19 
sequences were present in all 13 species. Detailed 
information regarding species used and orthologous 
sequences is given in the Supplementary Sheet 3. 
Further, we checked the pathways in which these 
common transcripts might be involved using 
DAVID 24,25 and found that they mainly related to 
oxidative phosphorylation, ribosome biogenesis, and 
cardiac muscle contraction. These are housekeeping 
genes 26,27 which explains the frequent occurrence of 
these in all avian species. With respect to the chromo- 
somal location of common transcripts, we did not 
find any significant bias related to any particular 
chromosome. 

3 . 7. Estimation of K 3 /K s 

Substitution rates were estimated for the 4574 
contigs orthologous to both zebra finch and chicken. 
After filtering (based on the length of alignment and 
removing frame shifts), the number of contigs was 
reduced to 3821. We excluded contigs that were 
either identical or which had /< s =0 (which made 
K 3 /K s incalculable). Thus, K a /K s was estimated for 
3252 (zebra finch) and 3127 (chicken) contigs. 
Rate estimation with zebra finch identified 43 
contigs with K a /K s >1 and 283 with values of 0.5- 
1.0 (Fig. 5A). Rate estimations with chicken yielded 
5 and 58 contigs with K a /K s >1 and between 0.5 
and 1.0, respectively (Fig. 5B). Afterwards, assuming 
the song sparrow contigs have the same chromosome 
organization as zebra finch and chicken, the calcu- 
lated ratios were organized into chromosomes 
(Table 3); this is not an unrealistic assumption consid- 
ering the high degree of chromosomal conservation 
among avian genomes 28,29 and the fact that such a 
high proportion (85.1%) of our orthologous contigs 
was found to have shared chromosomal locations 
with zebra finch and chicken. 

Although K a /K s (sometimes calculated as d N /d s or 
co) is commonly misinterpreted, 30 this ratio of rates 
of non-synonymous to synonymous substitutions 
can give some context to candidate genes and 
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Figure 5. The distribution of K a /K s ratio for the contigs orthologous 
to both zebra finch (A) and chicken (B). Contigs with K a /K s 
values of 0.5-1.0 fall above the grey line and values >1.0 fall 
above the black line. 

allows for subsequent hypothesis testing. 31,32 Data 
organized into chromosomes suggest that contigs 
may have undergone more selection with respect to 
the zebra finch than the chicken (as high K a /K s 
values are typically interpreted, though see ref. 30). 

The fact that K a /K s values were higher on average 
for the zebra finch than for the chicken (Table 3) is 
likely a methodological artefact. The zebra finch is in 
the same taxonomic order as the song sparrow 
(Passeriformes), whereas the chicken is taxonomically 
distant (Galliformes). Estimates of u> necessarily clas- 
sify sites with differences as non-synonymous or syn- 
onymous, and errors in the estimation of either can 
profoundly affect the outcome of these analyses. 33 
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Table 3. Number of contgis orthologous to particular zebra finch and chicken chromosomes, and mean K a /K s ratio for each 
chromosome, assuming the orthologous contigs have the same chromosomal location as zebra finch and chicken 



Chr 


Contigs 


Total number of 


K a /K s 


Contigs 


Total number of 


l<a/l<s 




orthologous to 


transcripts from 


(mean ± SD) 


orthologous to 


transcripts from 


(mean ± SD) 




particular zebra 


particular zebra finch 




particular chicken 


particular chicken 






finch chromosome 


chromosome in 




ch romosome 


chromosome in 








Biomart file 






Biomart file 




1 


261 


1 1 24 


0.2552 + 0.2733 


492 


2994 


0.1 528 + 0.1 694 


2 


338 


1 345 


0.2434 + 0.2465 


339 


1995 


0.1457 + 0.1 326 


3 


309 


1 1 69 


0.2434 + 0.2807 


314 


1 672 


0.1 565 ± 0.1497 


4 


1 88 


741 


0.2258 + 0.3347 


252 


1 51 6 


0.1 374 ± 0.1 274 


5 


229 


936 


0.21 03 ± 0.21 84 


234 


1 299 


0.1 280 + 0.1 21 9 


6 


107 


562 


0.2447 ± 0.21 1 2 


106 


781 


0.1486 + 0.1 1 87 


7 


1 24 


521 


0.2220 + 0.21 03 


1 20 


767 


0.1 361 + 0.1 235 


8 


1 1 1 


416 


0.2581 + 0.21 96 


1 27 


723 


0.1436 + 0.1 251 


9 


90 


458 


0.2286 + 0.3839 


86 


598 


0.1 045 + 0.1 087 


1 0 


86 


394 


0.1 784 + 0.1 738 


90 


599 


0.1220 + 0.1890 


1 1 


68 


371 


0.2330 + 0.2978 


61 


499 


0.1429 ± 0.1439 


1 2 


73 


349 


0.1 799 ± 0.2206 


68 


427 


0.1 076 ± 0.1 1 22 


1 3 


77 


321 


0.1 845 ± 0.231 9 


83 


499 


0.0994 ± 0.1 225 


14 


80 


390 


0.2541 ± 0.3448 


79 


578 


0.1 333 ± 0.1 288 


1 5 


76 


350 


0.1 81 7 + 0.2299 


73 


531 


0.0925 ± 0.1 207 


1 7 


49 


300 


0.1 705 + 0.1 597 


46 


432 


0.0967 + 0.0861 


1 8 


54 


309 


0.2230 + 0.1 950 


55 


428 


0.1 085 + 0.0907 


19 


68 


313 


0.2004 + 0.2982 


66 


443 


0.0858 + 0.0952 


20 


50 


329 


0.241 9 ± 0.2444 


51 


476 


0.1 336 ± 0.1 277 


21 


34 


192 


0.1 470 ± 0.1 569 


44 


346 


0.0847 ± 0.1 058 


22 


1 6 


98 


0.1 000 + 0.0976 


1 1 


1 60 


0.0441 ± 0.0593 


23 


34 


205 


0.1 783 + 0.1 828 


33 


288 


0.0782 ± 0.0920 


24 


27 


1 81 


0.1 961 + 0.1 906 


24 


270 


0.1 000 + 0.0982 


25 


7 


92 


0.1 1 61 + 0.1 069 


6 


1 69 


0.071 1 + 0.1 01 7 


26 


31 


1 76 


0.1 148 + 0.1081 


29 


341 


0.0824 + 0.0927 


27 


31 


252 


0.1471 + 0.1438 


28 


345 


0.0698 + 0.0727 


28 


27 


227 


0.1 1 02 ± 0.1 256 


23 


284 


0.0476 ± 0.0414 


Z 


149 


745 


0.2321 ± 0.2293 


146 


990 


0.1 381 ± 0.1 1 74 

















Taxonomic or lineage distance (longer branches) will 
affect the reconstruction of synonymous substitution 
rates especially (through an expected increase in 
repeated mutations, or multiple hits), and we con- 
sider this to be a likely source of the consistent differ- 
ences in apparent molecular selection between our 
song-sparrow-to-zebra-finch and song-sparrow-to- 
chicken contrasts (Table 3; see also ref. 34). 
Nevertheless, these contrasts are valuable in high- 
lighting the chromosomal distributions (assuming 
chromosomal stability 28 ) and relative values of w 
between closer and more distant relatives of the 
song sparrow, providing insights into attributes of se- 
lection in the coding genome across these scales. 



Unfortunately, this approach is not valid within 

35 — 37 

species. 

Chromosomes 22 and 26 showed the greatest dif- 
ferences between the zebra finch and the chicken in 
the percentage of song sparrow contigs mapped (rela- 
tive to the number of genes available in the Biomart 
database for the zebra finch and chicken). Both of 
these chromosomes had significantly different fre- 
quencies of mapped-song-sparrow versus Biomart 
data-available genes between the zebra finch and 
the chicken (C adj = 4.4, P< 0.05, and G ad j = 6.9, P< 
0.01, respectively at 1 d.f., G-test with Williams' cor- 
rection; Table 3). In both cases, proportionally more 
contigs were mapped to the zebra finch than to the 
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chicken given the sizes of the respective databases 
(Table 3). 

3.8. Chromosomal distributions of between-subspecies 
SNPs/indels 

Two findings emerged in comparing the among- 
chromosome locations (mapped against the zebra 
finch) of the between-subspecies SNPs/indels that 
were mapped to chromosomes (218 SNP/indel- 
bearing, between-subspecies song sparrow contigs; 
Supplementary Sheet 2) versus all orthologous song 
sparrow contigs (Table 3). First, the chromosomal dis- 
tribution of the candidate loci was significantly differ- 
ent from the distribution of all orthologous contigs 
(G a dj =51.5, 2 7 d.f., P < 0.005), indicative of a non- 
random process (e.g. selection). Importantly, the 
chromosomal distribution of the 1 99 unique, map- 
pable SNP/indel-bearing contigs between Attu and 
Adak islands (within the subspecies maxima), where 
we expected drift rather than selection to be more 
pronounced, was not significantly different from the 
chromosomal distribution of all orthologous contigs 
(Gadj = 3 5.1, 27 d.f., P> 0.1). Secondly, the greatest 
differences in the distribution of between-subspecies 
candidate loci from the distribution of all contigs oc- 
curred among chromosomes 2, 5, and Z (where pro- 
portionally fewer SNP/indel-bearing contigs occurred 
than expected) and chromosomes 3 and 1 1 (where 
relatively more SNP/indel-bearing contigs occurred 
than expected). 

Finally, in contrasting our between-subspecies 
results with those of our between-species compari- 
sons above, we found that seven of the SNP/indel- 
bearing contigs between subspecies were also 
contigs that exhibited evidence suggestive of selection 
(high K a /K s values) when compared with the zebra 
finch and the chicken. Each contig has one between- 
subspecies SNP, and the functions of these loci are 
variable (Supplementary Sheet 4). Three of these 
seven occurred on chromosome 3 and one on 
chromosome 1 1 , where the between-subspecies con- 
trasts suggested elevated levels of SNPs/indels. These 
contigs and their chromosomal locations may thus 
be important in songbird divergence, but we do not 
yet know why. 

3.9. Summary 

In summary, our analysis identified the major 
categories of song sparrow genes and orthologous 
loci between song sparrow/zebra finch and 
song sparrow/chicken. Substitution rate estimation 
yielded the fastest evolving loci, and some of the loci 
that were fixed between subspecies were also high- 
lighted as possibly under selection between the song 
sparrow and the zebra finch. Although additional 



sequencing of these libraries and validation of 
within-species SNPs/indels in multiple populations 
and lineages is required, we consider that the loci 
described here will include some of broad utility for 
studying the genomics of songbird divergence. 
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